One of the most basic problems in data science goes something like this:
I want to fit a curve to data that is flexible enough to capture non-linear relationships (i.e. go through known data points without any error) but I also want confidence intervals for predictions at unknown points.
Most methods give you one or the other. You can go with linear/polynomial regresssions and have confidence intervals, but they won't perfectly fit your data. Or you can go with splines, LOESS and other non-parametric methods, but then you won't get confidence intervals out of the box (you still can estimate them with bootstrapping but it's not always straightforward).
It was in this context that I learned about Gaussian Processes (GPs) first. GPs give you both a perfect interpolating function over known data points while also providing variance estimates at unknown data points. Here's a very basic (and recognizable) illustration of a GP in action.
Parametric algorithms typically fit a fixed number of parameters to data (think linear regression and its coefficients). Non-parametric methods, on the other hand, typically use data points themselves as "parameters" (think K-nearest neighbours where classification depends solely on existing data points). Gaussian Processes fall somewhere in between. Instead of fitting a fixed number of parameters, they fit functions to data, where the functions are drawn from a multivariate normal distribution and thus defined only by the mean and variance. One of the best definitions I found is from Julia's GP package:
Gaussian processes are a family of stochastic processes which provide a flexible nonparametric tool for modelling data. A Gaussian Process places a prior over functions, and can be described as an infinite dimensional generalisation of a multivariate Normal distribution. Moreover, the joint distribution of any finite collection of points is a multivariate Normal. This process can be fully characterised by its mean and covariance functions, where the mean of any point in the process is described by the mean function and the covariance between any two observations is specified by the kernel. Given a set of observed real-valued points over a space, the Gaussian Process is used to make inference on the values at the remaining points in the space.
In some ways, GPs thus incorporate both aspects - they have fixed parameters (the mean and the variance of the prior), but they also use data as parameters (as the posterior of the functions is drawn from n-dimensional Gaussian distribution, where n is the number of sample points). If you are interested to understand GPs deeper (and I definitely suggest that), I highly recommend a tutorial on YouTube by Richard Turner and a visual exploration of Gaussian processes by three researchers from University of Konstanz.
Finally, I cannot leave out two wonderful memes by Richard McElreath...
help pic.twitter.com/wbDo3CqpTK
— Richard McElreath 🦔 (@rlmcelreath) February 21, 2022
If you are wondering how one can fit a "infinite-dimensional" normal distribution, the answer lies in the fact that a marginal distribution of a normal distribution is normal, just like is a joint and a conditional distribution of two normal distributions. Richard Turner does a really great job at illustrating these dynamics in his tutorial.
For all their coolness, GPs achile's heel is computational complexity. They scale with amount of data as any other non-parametric methods, with most implementations having $O(n^3)$. Improving that is an active line of research, whereas for every day data scientists concerned with speed can choose between GPU-enabled (https://gpytorch.ai/)[GPyTorch] and efficiency focused celerite libraries.
With that said, this post/notebook will focus on a couple of applications of Gaussian processes:
One of the earliest applications of Gaussian Process is known as kriging, named after Danie G. Krige, a mining enginner from South Africa. Staying close to the roots, this notebook uses weather forecasts in Lithuania to illustrate GPs (and it makes for pretty charts, too!). An extract of the dataset is displayed below. This notebook uses GPy library. If you are reading it in post format, some of the code cells are excluded for brevity. You can find the entire notebook and associated data on GitHub.
## HIDE
def collect_data():
imp.reload(MeteoAPI)
#Setup the API object
meteoApi = MeteoAPI.MeteoAPI(use_cache = True, cache_dir = "meteo-cache")
#Get all available places
places = meteoApi.get_places()
#Get all forecasts and save as a single dataframe
all_forecasts = [meteoApi.get_forecast(p) for p in places]
merged_forecasts = pd.concat(all_forecasts)
#remove columns that are not interesting for us
drop_cols = ['seaLevelPressure', 'relativeHumidity', 'windDirection', 'windGust', 'conditionCode']
forecasts = merged_forecasts.drop(drop_cols, axis=1)
#clip to Lithuania's borders and save
lithuania = gpd.read_file('lt-shapefile/lt.shp')
lt_forecasts = gpd.clip(forecasts, lithuania).reset_index(drop=True)
lt_forecasts.to_file("lt_forecasts.geojson", driver='GeoJSON', index=False)
#collect_data()
#read in weather forecasts and Lithuania's shape file
lt_forecasts = gpd.read_file("lt_forecasts.geojson")
lithuania = gpd.read_file('lt-shapefile/lt.shp')
lt_forecasts.head(n=10)
| forecastTimeUtc | airTemperature | windSpeed | cloudCover | totalPrecipitation | location | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 2022-02-01T06:00:00 | -2.8 | 4.0 | 18.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 1 | 2022-02-04T21:00:00 | 1.7 | 5.0 | 100.0 | 1.2 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 2 | 2022-02-06T18:00:00 | 2.4 | 7.0 | 97.0 | 3.5 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 3 | 2022-02-06T12:00:00 | 2.1 | 5.0 | 100.0 | 1.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 4 | 2022-02-06T06:00:00 | 0.3 | 7.0 | 100.0 | 0.9 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 5 | 2022-02-06T00:00:00 | 0.3 | 5.0 | 76.0 | 1.2 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 6 | 2022-02-05T18:00:00 | 0.1 | 5.0 | 43.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 7 | 2022-01-31T05:00:00 | -1.3 | 9.0 | 12.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 8 | 2022-01-31T06:00:00 | -1.5 | 9.0 | 9.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
| 9 | 2022-01-31T07:00:00 | -1.7 | 9.0 | 4.0 | 0.0 | Senoji Radiškė | POINT (23.25380 54.29488) |
Let's start with the classical application of Gaussian Process in geostatistics. Imagine you are building an application that requires knowing air temperature at any given location in Lithuania. What you have, though, is just the measurements at weather stations. Here's how it looks visually.
#pick a single forecast time
fc_time = lt_forecasts.loc[0, 'forecastTimeUtc']
morning_forecast = lt_forecasts[lt_forecasts['forecastTimeUtc'] == fc_time].reset_index(drop=True)
Let's use Gaussian Process regression to interpolate. We will fit a GP using coordinates as features and air temperature as the dependent variable.
#organize data into predictors and independent variables
coordinates = np.zeros((len(morning_forecast),2))
coordinates[:,0] = morning_forecast.geometry.x.values
coordinates[:,1] = morning_forecast.geometry.y.values
actual_temperatures = morning_forecast['airTemperature'].values.reshape(-1, 1)
#fit the simplest GP model with default parameters
interpolation_model = GPy.models.GPRegression(X=coordinates, Y=actual_temperatures)
interpolation_model.optimize_restarts(num_restarts = 1, verbose=False)
#define the grid covering Lithuania
x_space = np.linspace(coordinates[:,0].min(), coordinates[:,0].max(), 100)
y_space = np.linspace(coordinates[:,1].min(), coordinates[:,1].max(), 100)
xx, yy = np.meshgrid(x_space, y_space)
grid = np.concatenate((xx.reshape(-1,1), yy.reshape(-1, 1)), axis=1)
#obtain predictions over the grid - GPY returns both mean and variance
p_means, p_vars = interpolation_model.predict(grid)
And, here it is - now we have predictions over the entire country. Looks pretty, doesn't it?
#collect predictions into a GeoPandas dataframe
points = gpd.points_from_xy(grid[:,0], grid[:,1], crs='EPSG:4326')
interpolated = gpd.GeoDataFrame(p_means, columns=['temp'], geometry=points)
interpolated['scaled_temp'] = (p_means - p_means.min()) / (p_means.max() - p_means.min())
lt_interpolated = gpd.clip(interpolated, lithuania).reset_index(drop=True)
#plot as a heatmap
heatLayer = pdk.Layer(
'HeatmapLayer',
lt_interpolated,
opacity=0.5,
get_position="geometry.coordinates",
aggregation=pdk.types.String("MEAN"),
color_range=[[c*255 for c in colorscale(i).clamped_rgb] for i in [0,0.5,1]],
threshold=1,
get_weight="scaled_temp",
pickable=True,
)
r = pdk.Deck(layers=[heatLayer,point_layer], initial_view_state=lt_view)
r.to_html()
Pretty, sure, but are these predictions any good? Let's test it out. I will split the dataset into a train and test set and let's measure the prediction accuracy by mean absolute error (MAE).
#split the dataset
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import mean_absolute_error
train_coords, test_coords, train_temps, test_temps = tts(coordinates, actual_temperatures, random_state=42)
#small helper function to get training and test set errors
def get_errors(models, X_train, X_test, Y_train, Y_test):
results = []
for label, model in models.items():
training_error = mean_absolute_error(Y_train, model.predict(X_train)[0])
test_error = mean_absolute_error(Y_test, model.predict(X_test)[0])
results.append((label, training_error, test_error))
return pd.DataFrame(results, columns=['model', 'Training MAE', 'Test MAE'])
#let's create an initial model and test its accuracy
models = {
'basic-GP': GPy.models.GPRegression(X=train_coords, Y=train_temps)
}
models['basic-GP'].optimize_restarts(num_restarts=1, verbose=False)
get_errors(models, train_coords, test_coords, train_temps, test_temps)
| model | Training MAE | Test MAE | |
|---|---|---|---|
| 0 | basic-GP | 0.095522 | 0.122836 |
Looks pretty good to me! It's off by 0.1 degree (Celsius) on average in a dataset where temperature ranges from $-7.9$ to $-0.3$ degrees. But there is more. Remember that one of the unique aspects of a Gaussian Process is that we also get confidence (variance) over the predictions. Let's see how accurate the confidence intervals are.
#get predictions for the test set
test_means, test_vars = models['basic-GP'].predict(test_coords)
#list of confidence intervals of interest
conf_ints = pd.DataFrame([0.5, 0.7, 0.9, 0.95, 0.99], columns=['confidence level'])
conf_ints['total observations'] = len(test_means)
#helper to find # of observations falling into a confidence interval
def count_in_range(cint):
factor = -norm.ppf((1 - cint) / 2)
side_range = np.sqrt(test_vars) * factor
return np.sum((test_temps <= (test_means + side_range)) & (test_temps >= (test_means - side_range)))
#calculate them all
conf_ints['observations in interval'] = conf_ints['confidence level'].apply(count_in_range)
conf_ints['% observations in interval'] = conf_ints['observations in interval'] / conf_ints['total observations']
conf_ints
| confidence level | total observations | observations in interval | % observations in interval | |
|---|---|---|---|---|
| 0 | 0.50 | 513 | 290 | 0.565302 |
| 1 | 0.70 | 513 | 393 | 0.766082 |
| 2 | 0.90 | 513 | 479 | 0.933723 |
| 3 | 0.95 | 513 | 495 | 0.964912 |
| 4 | 0.99 | 513 | 509 | 0.992203 |
It seems that in our case, the confidence intervals produced by Gaussian Process are actually even conservative!
You may have noticed that I have used all defaults of the GPy regresssion module, and may be curious if an even better performance may be achieved by tuning some of the hyperparameters. Indeed, we can. By far the most important hyperparameter in Gaussian Processes is the kernel function that defines the correlation between two data points (the default is radial basis function, $R(h) = e^{-\theta h^2}$). GPy has a number of kernels implemented, and even allows defining custom kernels. They have an entire notebook dedicated to exploring kernels which is definitely worthwhile skimming through.
Let's see what we can achieve in our interpolation task by using customer kernels. Specifically, let's try out:
matern_kernel = GPy.kern.Matern52(input_dim=2)
linear_kernel = GPy.kern.Linear(input_dim=2)
models['Matern52 kernel'] = GPy.models.GPRegression(X=train_coords, Y=train_temps, kernel = matern_kernel)
models['Matern52 kernel'].optimize_restarts(num_restarts=1, verbose=False)
models['Matern52 + linear kernel'] = GPy.models.GPRegression(X=train_coords, Y=train_temps, kernel = matern_kernel + linear_kernel)
models['Matern52 + linear kernel'].optimize_restarts(num_restarts=1, verbose=False)
get_errors(models, train_coords, test_coords, train_temps, test_temps)
| model | Training MAE | Test MAE | |
|---|---|---|---|
| 0 | basic-GP | 0.095522 | 0.122836 |
| 1 | Matern52 kernel | 0.055417 | 0.099285 |
| 2 | Matern52 + linear kernel | 0.044996 | 0.097688 |
And, indeed - the performance is better; using kernels allowed reducing training MAE by more than 50% and test MAE by 20%. So, if there is one thing to remember - make sure to test different kernels (and see if you can reason what kernel combinations may be most appropriate given your data).
The flexibility of kernel functions is what makes GPs useful for time series forecasting tasks, too. However, you won't get far with a default RBF kernel as its predictions converge to zero (or mean, if a bias kernel is added) very quickly once you step out of a domain observed during training. Here's an illustration using the 1D example from earlier.
Let's illustrate the right way to do it using one of the city temperature forecasts. We will take the first 48 hours of predictions and see if we can extend it.
#get the relevant data for Kaunas forecast
point_forecast = lt_forecasts[lt_forecasts['location'] == "Kaunas"]
point_forecast['forecastTimeUtc'] = pd.to_datetime(point_forecast['forecastTimeUtc'])
point_forecast['hours_offset'] = (point_forecast['forecastTimeUtc'] - point_forecast['forecastTimeUtc'].min()).astype("timedelta64[h]")
point_forecast = point_forecast.sort_values(['forecastTimeUtc'])
first_two_days = point_forecast.head(48)
alt.Chart(first_two_days).mark_line().encode(
x = alt.X("yearmonthdatehours(forecastTimeUtc):O", title="Time"),
y = "airTemperature"
).properties(
title="48-hour temperature forecast in Kaunas",
width=600
)
Given the cyclical nature of temperature during a day, let's use a periodic exponential kernel combined with a linear kernel (to account for trends).
#get the variables
hours_offset = first_two_days['hours_offset'].values.reshape(-1, 1)
temperature = first_two_days['airTemperature'].values.reshape(-1, 1)
#run the prediction
forecast_kernel = (GPy.kern.PeriodicExponential(input_dim=1) + GPy.kern.Linear(input_dim=1))
forecast_model = GPy.models.GPRegression(X=hours_offset, Y=temperature, kernel = forecast_kernel, normalizer=True)
forecast_model.optimize_restarts(num_restarts=1, verbose=False)
plot_temp_forecast(forecast_model)
The predictions appear quite reasonable - GP definitely picked up the cyclical trends and there seem to be a linear downwards trend, too. But there's something fishy. So far, I have been showing how GPs perfectly interpolate existing data and that is clearly not the case here. Furthermore, what is with the variances? They seem to be constant over time which is weird, too. What is going on?
Well, the answer is that the GP optimization process is stochastic, and a single attempt did not yield good results in this case. If we rerun the optimization process multiple (say, 10) times, we will see much better results. This is another take away about GPs - in most cases, you may need to re-initialize the training multiple times to get what you are looking for.
At the same time, looking at the chart below you may reasonably ask - is this not overfitting? It sure does look like to me. How do you go about reducing the overfitting? Recall that GPs are defined over a prior of functions. I have not provided any priors to GPy but that is possible to do, just as is introducing constraints. You can find more detail in one of the GPy tutorials.
#rerun optimization 10 times
forecast_model.optimize_restarts(num_restarts=10, verbose=False)
plot_temp_forecast(forecast_model)
Another cool aspect of Gaussian Processes is that they can model relationships between multiple output variables. I have to admit that the maths behind how it works exactly is beyond me, though I suspect it must have to do with the fact that a joint distribution of multiple normal distributions is a normal distribution, so one can model the joint distribution as a single output and then marginalize it as needed.
Continuing with the weather data examples, suppose we had observations from 1500 stations in total, where:
Further, let's assume that those stations are distributed in the south-north direction (i.e. the temperature ones tend to be up north, the wind speed ones in the south, and the stations able to measure both are thus likely to be in the middle part of the country). This simply to reduce the overlap between the two variables in the input space, making the prediction task more difficult.
If we needed to build models to predict temperature and wind speed at other locations, we could simply build two models independently, using 1000 data points each. But we may have reason to believe that temperature and wind speed are related, and capturing that relationship in the model may be beneficial. Let's see if that is the case.
To establish the baseline, let's first train GP regression for the two output variables separately and assess the test error.
#let's use 25% and 75% quantiles as south/north centers
south_center, north_center = np.quantile(morning_forecast.geometry.y.values, q=[0.25, 0.75])
sd = np.std(morning_forecast.geometry.y.values) * 2
#use two normal distributions to generate probability densities
north_dens = norm.pdf(morning_forecast.geometry.y.values, loc=north_center, scale=sd)
south_dens = norm.pdf(morning_forecast.geometry.y.values, loc=south_center, scale=sd)
#normalize densities to probabilities
prob_of_north = north_dens / (north_dens + south_dens)
#sample stations that have temperature measurements
temp_stations = morning_forecast.sample(n=1000, weights = prob_of_north)
speed_stations = morning_forecast.sample(n=1000, weights = (1 - prob_of_north))
#check overlap
print("Number of stations with both measurements: {}".format(len(set(speed_stations.index).intersection(set(temp_stations.index)))))
Number of stations with both measurements: 493
#split into train and test sets
train_temp, test_temp, train_speed, test_speed = tts(temp_stations, speed_stations, random_state=72, train_size=800)
#organize data into predictors and independent variables
y_variables = ['airTemperature', 'windSpeed']
no_y_variables = len(y_variables)
#prepare a list of coordinates/obs for training (kept as a list of 2)
training_coords = [
np.vstack([train_temp.geometry.x.values, train_temp.geometry.y.values]).T,
np.vstack([train_speed.geometry.x.values, train_speed.geometry.y.values]).T
]
training_obs = [
train_temp[y_variables[0]].values.reshape(-1, 1),
train_speed[y_variables[1]].values.reshape(-1, 1),
]
#prepare coordinates/obs for test (joint test set)
test_coords = np.vstack([
np.vstack([test_temp.geometry.x.values, test_temp.geometry.y.values]).T,
np.vstack([test_speed.geometry.x.values, test_speed.geometry.y.values]).T
])
test_obs = np.vstack([
test_temp[y_variables].values,
test_speed[y_variables].values,
])
#kernels
K1 = GPy.kern.Bias(2)
K2 = GPy.kern.Linear(2)
K3 = GPy.kern.Matern32(2)
#train independent models
result_bag = []
kernel = K1 * K3 + K3 * K2
for i, name in enumerate(y_variables):
train_x = training_coords[i]
train_y = training_obs[i]
test_x = test_coords
test_y = test_obs[:, i, None]
singleOutput_model = GPy.models.GPRegression(X=train_x, Y=train_y, kernel=kernel)
singleOutput_model.optimize_restarts(num_restarts=1, verbose=False)
preds = singleOutput_model.predict(test_x)
test_error = mean_absolute_error(test_y, preds[0])
result_bag.append((name, 'single-output', test_error))
pd.DataFrame(result_bag, columns=['Variable', 'Model type', 'Test MAE'])
| Variable | Model type | Test MAE | |
|---|---|---|---|
| 0 | airTemperature | single-output | 0.111711 |
| 1 | windSpeed | single-output | 0.265519 |
Now, let's build a multi-output GP regression. The GPy model type to be used for such scenarios is GPCoregionalizedRegression, and it has a somewhat different API than most of the other GPy models (you need to pass inputs as list for each of the output variable; predictions require an extra index column in the input array). I based the code below on their dedicated tutorial for coregionalized regressions.
#Define a multi-output kernel
icm = GPy.util.multioutput.LCM(input_dim=2, num_outputs=2, kernels_list=[K1, K2, K3])
#Setup the model and optimize
multiOutput_model = GPy.models.GPCoregionalizedRegression(X_list=training_coords, Y_list=training_obs, kernel = icm)
multiOutput_model.optimize_restarts(num_restarts=5, verbose=False)
#Setup the array used for predictions; it uses an extra column to indicate output index to predict
inds = np.ones((test_coords.shape[0], 1))
predSpace = np.vstack([np.hstack([test_coords, inds * i]) for i in range(no_y_variables)])
noise_dict = {'output_index': predSpace[:, 2].astype(int)}
#get all predictions and evaluate
obs_preds = multiOutput_model.predict(predSpace, Y_metadata=noise_dict)
result_bag = []
for i, name in enumerate(y_variables):
test_y = test_obs[:, i, None]
pred_y = obs_preds[0][predSpace[:, 2] == i]
test_error = mean_absolute_error(test_y, pred_y)
result_bag.append((name, 'multi-output', test_error))
pd.DataFrame(result_bag, columns=['Variable', 'Model type', 'Test MAE'])
KeyboardInterrupt caught, calling on_optimization_end() to round things up
So far, I explored use cases where GP is the main model we use for fitting. The flexibility of GPs, however, make them very popular candidates to used as surrogate functions, too. Imagine you have a model that takes a long time to run - perhaps it is something as simple as an XGBoost classifier on a relatively large dataset, or maybe it is a complex physics simulation of the universe that takes a day to run on a supercomputer cluster. You can only make a few runs of them with different (hyper-) parameters, and you would like to find the optimal setting of those hyperparameters (e.g. tune a learning rate for XGBoost) or use the limited outputs to interpolate in the rest of input space. In such scenarios, you can do the following:
Effectively, the variance information produced by Gaussian Processes allows us to trade-off between exploration and exploitation of the parameter space. There are multiple ways how to determine the next point for evaluation, one of them is Expected Improvement maximization approach. I won't delve into detail on it here - I highly recommend a post by Martin Krasser on Bayesian Optimization if you are curious about the details.
In this notebook, I won't fit XGBoost models. Instead, I will illustrate the concept with a simple task. Let's pretend our task is to find the coldest place in Lithuania as measured by chill factor (which happens to be a village called Kirmėliai, with wind chill factor of -11.58). Here's how the distribution of chill factor looks geographically; you can see it is not a convex environment, and so using gradient-based approaches is unlikely to get us to the right place.
Instead, let's use Bayesian optimization with Gaussian processes. There are a few python packages that support Bayesian optimization, most notably hyperopt and pyGPGO. Hyperopt uses tree of Parzen estimators (TPE) as the surrogate function, however. For this illustration, I chose to go with the aptly named BayesianOptimization package that uses Gaussian processes and supports Expected Improvement algorithm for iterative choices.
Let's start by randomly choosing 4 points in Lithuania and evaluating the chill factor there (we'll use the nearest station measure for any given point). Then, let's explore additional 15 locations iteratively, where each subsequent location is chosen by Expected Improvement criteria using a Gaussian Process model fit on all observations collected so far.
from bayes_opt import BayesianOptimization
#set the bounds to be the bounding box of Lithuania
param_bounds = {
'lat': (bounds[0], bounds[2]),
'lon': (bounds[1], bounds[3])
}
#setup the optimizer
optimizer = BayesianOptimization(f=get_chill_factor, pbounds=param_bounds, random_state=42, verbose=False)
#let's go exploring!
optimizer.maximize(init_points=4, n_iter=15, acq = 'ei', xi=0)
What are the results? Turns out we were able to find the coldest place! To get the intuition behind what happened, here's a graphical illustration of the algorithm's path. You can see how trade-offs between exploring unknown areas vs. exploiting the information collected so far about the maximum observed value changes as more and more data points are collected. Much better than any grid search!
explorations = gpd.GeoDataFrame([{'target': d['target'],
'lat': d['params']['lat'],
'lon': d['params']['lon']} for d in optimizer.res])
#type conversion lat/lon to geometry column
points = [Point(lon, lat) for i,lat,lon in explorations[['lon', 'lat']].itertuples()]
explorations = explorations.assign(**{'geometry': gpd.GeoSeries(points).set_crs('EPSG:4326')})
explorations = explorations.drop(['lat', 'lon'], axis = 1)
print("Lowest chill factor found: {:.5f}".format(-explorations['target'].max()))
Lowest chill factor found: -11.58668
If you got here - thank you for reading and I hope this journey was interesting!
To be fair, I feel like this is just scratching the surface on what one can do with Gaussian Processes, and there are all kind of crazy interesting things out there (deep Gaussian Processes, for example, where GPs are "stacked" on top of each other thus eliminating the need to manually specify kernels, for example..). Most of them are beyond my ability to comprehend, but perhaps they won't be for you! At the same time, having the basic understanding of GPs and their value as flexible fitting functions that provide you with confidence estimates gets you pretty far already. Here's an example from Lyft where they use Gaussian processses and Bayesian optimization to run experiments.
On a final note, if you happen to be Lithuanian, here's an interesting fact: turns out that a Lithuanian mathematician, Jonas Mockus was a very influential figure in Bayesian optimization space. I was pleasantly surprised to see a familiar name when researching for this post.